library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(wordcloud2)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-8
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(e1071)

group members

Yuxuan Du: yd2739

Tara Zhan: fz2377

Yaduo Wang: yw4199

Chenshuo Pan :cp3384

Wenxin Tian: wt2369

Project motivation

Amidst the burgeoning gaming industry, popular belief posits genres such as shooters, sports, and racing as perennial favorites. As gamers, our team believes that there is a “success formula” to guarantee the success of video games. Our team’s avid gaming experience through STEAM has piqued our interest in this hypothesis. Our objective is to dissect and analyze the veracity of such claims and identify any characteristics that signal the popularity of STEAM games.

anticipated data scources

https://www.kaggle.com/datasets/fronkongames/steam-games-dataset

Data import and cleaning

There is an inclusion criteria for our analysis:

We will exclude free games from our analysis. This decision is based on the fact that in free games, in-app purchases often serve as alternatives to traditional game sales. However, since we lack access to data regarding these in-app purchases, including such games in our analysis would be both impractical and imprecise.

We will remove columns that contain URLs, as they do not provide useful information for our analysis.

Additionally, we will filter out variables related to developers and publishers for two reasons: Firstly, we believe it is challenging, if not impossible, to make predictions using these features. Secondly, we assume that brand effects should be reflected through a combination of game features and keywords, thus including most of the brand impact.

Furthermore, we will exclude the average and median playtime in current two weeks since this might not representative enough for old game. Also average playtime forever is not as good as median playtime forever since there are AFK gamers that affect the data.

Lastly, we won’t include the ‘Full Audio Language’ variable, as some games may not contain dialogue, making it irrelevant for our analysis.

We will further solve the data leakage problem and multicollinearity when we fit the model.

df = read.csv("DATA/games.csv")|>
  janitor::clean_names()|>
  subset(select = -c(dlc_count, about_the_game, reviews, header_image, website, support_url, support_email, metacritic_score, metacritic_url, notes, developers, publishers, screenshots, movies, score_rank, average_playtime_two_weeks, median_playtime_two_weeks, full_audio_languages, average_playtime_forever, peak_ccu))|>
  subset(price > 0)

Change the format of release date.

df = df|>
  mutate(release_date = as.Date(release_date, format = "%b %d, %Y"))

EDA

price distribution of games with different estimated owners

# we filter out Ascent Free-Roaming VR Experience since its price is 1000 dollars, which will make it hard to see any trend from our plot. 
df|>
  subset(name != "Ascent Free-Roaming VR Experience")|>
  separate_wider_delim(cols = estimated_owners, delim = " - ", names = c("estimate_low", "estimate_high"))|>
  mutate(estimate_low = as.numeric(estimate_low))|>
  mutate(estimate_high = as.numeric(estimate_high))|>
  mutate(estimate_mean = (estimate_high + estimate_low)/2)|>
  mutate(estimate_mean = factor(estimate_mean))|>
  ggplot(aes(x = estimate_mean, y = price))+
  geom_boxplot()+
  coord_flip()

We could see that the games with more owners have comparatively higher price, but after reaching certain owners, the price of games starts to decrease.

Overall frequency of keywords(genre, categories, tags)

categories_freq = df|>
  separate_rows(categories, sep = ",")|>
  group_by(categories)|>
  summarise(n_obs = n())|>
  arrange(desc(n_obs))|>
  ggplot(aes(x = reorder(categories, +n_obs), y = n_obs))+
  geom_bar(stat="identity")+
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1)) +
  coord_flip()

categories_freq

genre_freq = df|>
  separate_rows(genres, sep = ",")|>
  group_by(genres)|>
  summarise(n_obs = n())|>
  arrange(desc(n_obs))|>
  ggplot(aes(x = reorder(genres, +n_obs), y = n_obs))+
  geom_bar(stat="identity")+
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1)) +
  coord_flip()

genre_freq

tags_freq = df|>
  separate_rows(tags, sep = ",")|>
  group_by(tags)|>
  summarise(n_obs = n())|>
  subset(n_obs >= 5000)|>
  arrange(desc(n_obs))|>
  ggplot(aes(x = reorder(tags, +n_obs), y = n_obs))+
  geom_bar(stat="identity")+
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1)) +
  coord_flip()

tags_freq

Trend of keywords frequency across different years

It might be informative to visualize the trend change of game genre developed in different years since WOW might be good game at 2000s, while games like Overwatch might be better game in 2020s.

genre_freq_year = df|>
  mutate(year = year(release_date))|>
  separate_rows(genres, sep = ",")|>
  group_by(year, genres)|>
  summarise(n_obs = n())|>
  group_by(year)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
genre_freq_year_total = df|>
  mutate(year = year(release_date))|>
  separate_rows(genres, sep = ",")|>
  group_by(year)|>
  summarise(n_obs_total = n())
  

genre_freq_year_total|>
  knitr::kable()
year n_obs_total
1997 3
1998 1
1999 2
2000 2
2001 7
2002 2
2003 7
2004 7
2005 10
2006 79
2007 118
2008 230
2009 477
2010 419
2011 477
2012 659
2013 1011
2014 3312
2015 6109
2016 9853
2017 14823
2018 19051
2019 18579
2020 22777
2021 27694
2022 30735
2023 19446
NA 277

We could see from the table that there are very few games on steam before 2006, genre frequency might be strongly affected by randomness in those years, so we choose only to plot the trend after 2005.

#here we limit the year to later than 2002 since steam is lauched in 2002, we think the game before 2002 might be not representative. 
left_join(genre_freq_year, genre_freq_year_total, by = "year")|>
  subset(year > 2005)|>
  mutate(genre_ratio = n_obs/n_obs_total)|>
  ungroup()|>
  plot_ly(x = ~year, y = ~genre_ratio, color = ~genres, type = "scatter", mode = "lines+markers",  colors = "viridis")
popular_genres_bar_plot = left_join(genre_freq_year, genre_freq_year_total, by = "year")|>
  subset(year > 2005)|>
  mutate(genre_ratio = n_obs/n_obs_total)|>
  ggplot(aes(x = year, y = genre_ratio, fill = genres)) + 
  geom_col(position = 'stack', width = 0.6)+
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1))
popular_genres_bar_plot

Observation 需要再加一下

Word Cloud visualization

Wwordcloud could help visualize the the frequency of keywords in popular/unpopular games.

overwhelm_popular_wordcloud = df|>
  subset((positive+negative) > 500)|>
  subset(positive/(positive+negative) > 0.95)|>
  separate_rows(tags, sep = ",")|>
  group_by(tags)|>
  summarise(n_obs = n())|>
  wordcloud2()

popular_tags_wordcloud = df_popular|>
  separate_rows(tags, sep = ",")|>
  group_by(tags)|>
  summarise(n_obs = n())|>
  wordcloud2()

unpopular_tags_wordcloud = df_unpopular|>
  separate_rows(tags, sep = ",")|>
  group_by(tags)|>
  summarise(n_obs = n())|>
  wordcloud2()
overwhelm_popular_wordcloud
popular_tags_wordcloud
unpopular_tags_wordcloud

We can observe differences among the three word clouds above.

Firstly, casual games are prevalent among popular games but not as common in unpopular games. This could be because casual games attract a larger user base, increasing the likelihood of meeting the popularity criteria.

Simultaneously, we notice that the tag “difficult” appears more frequently in the word cloud for popular games. This aligns with reality, as some of the best games in history, such as Elden Ring and Sekiro: Shadows Die Twice, are known for their challenging gameplay.

An interesting observation is that the ratio of 2D games to 3D games is higher in popular games compared to unpopular ones. This is reasonable since plot and game mechanics often take precedence in game development. Small studios may struggle to allocate sufficient resources to create high-quality 3D environments, prompting them to opt for 2D games more frequently. This contributes to the higher frequency of 2D games in the popular category.

Furthermore, anime and cute are tags that exhibit a higher frequency in popular games. The exact reason for this trend is uncertain, but we plan to conduct a more in-depth analysis of games associated with these tags.

model fitting

Preventing data leakage

Since it is hard for models to directly analyze keywords as string, we one-hot encoded categories, genres, and tags, which facilitate models to analyze the dataset.

We noticed that some tags might cause data leakage in our dataset, we filter them out. These tags includes: Masterpiece, Great Soundtrack, Addictive, benchmark, classic. It is easy to observe that they are conclusive words that cannot be determined by game producer.

df_concat = df|>
  mutate(keywords = paste(df$categories, df$genres, df$tags, sep = ","))|>
  subset(select = -c(categories, genres, tags))
keywords_df = df_concat|>
  subset(select = c(app_id, keywords))|>
  separate_rows(keywords, sep = ",")|>
  distinct(app_id, keywords, .keep_all = TRUE)|>
  mutate(value = 1)|>
  subset(keywords != "")|>
  pivot_wider(names_from = keywords, values_from = value, values_fill = 0)

one_hot_encoded_df = left_join(df_concat, keywords_df, by = "app_id")

popular_encoded_df = one_hot_encoded_df|>
  subset((positive+negative) > 10)|>
  subset(positive/(positive+negative) > 0.8)|>
  subset(estimated_owners != "0 - 20000")|>
  subset(median_playtime_forever > 120)|>
  mutate(popular = 1)

unpopular_encoded_df = anti_join(one_hot_encoded_df, popular_encoded_df, by="app_id")|>
  mutate(popular = 0)

encoded_with_label_df = rbind(popular_encoded_df, unpopular_encoded_df)|>
  janitor::clean_names()|>
  subset(select = -c(positive, negative, median_playtime_forever, estimated_owners, recommendations, user_score,  supported_languages, keywords, addictive, masterpiece, great_soundtrack, benchmark, classic))|>
  mutate(windows = as.integer(as.logical(windows)))|>
  mutate(mac = as.integer(as.logical(mac)))|>
  mutate(linux = as.integer(as.logical(linux)))

dataset_without_id_name = encoded_with_label_df|>
  subset(select = -c(app_id, name, release_date))|>
  drop_na()

feature selection

there are tags that have hierarchical relationship(For example, RPG and tactical_rpg, party_based_rpg), we might want to find those related variables and remove some of them.

find_col_pairs = function(df) {
  n = ncol(df)
  result = character(0)
  
  for (i in 1:(n-1)) {
    for (j in 1:n) {
      col_i = df[, i]
      col_j = df[, j]
      dot_product = sum(col_i * col_j)
      
      if (sum(col_i) == dot_product & dot_product != 0 & i != j & sum(col_i) != 0) {
        result = c(result, paste(colnames(df)[i], colnames(df)[j], sep = "-"))
      }
    }
  }
  
  return(result)
}

subgroup_features = find_col_pairs(dataset_without_id_name)

From the result, most feature are sub group of windows, single_player, multi_player, which is expected since most games are categorized into these category. We decide to remove these features since they might affect our model. We also found tags batman, fox, birds, football_american, tile_matching , tracked_motion_controller_supportonly present once, coding only appears twice, to make our model more generalizable, we decide to drop these columns.

dataset_without_id_name = dataset_without_id_name|>
  subset(select = -c(windows, single_player, multi_player, multiplayer, birds, football_american, fox, batman, coding, tile_matching,tracked_motion_controller_support))

Lasso regression

Here, we choose lasso regression since it could perform feature selection.

#first, we split target and tags, then we split train and test datasets
dataset_without_id_name= dataset_without_id_name[sample(1:nrow(dataset_without_id_name)), ] 
dataset_without_id_name$id = 1:nrow(dataset_without_id_name)
train = dataset_without_id_name |>
  sample_frac(0.70)
test = anti_join(dataset_without_id_name, train, by = 'id')

train = train|>
  subset(select = -c(id))

test = test|>
  subset(select = -c(id))

train_x = train|>
  subset(select = -c(popular))
train_y = train|>
  pull(popular)

test_x = test|>
  subset(select = -c(popular))

test_y = test|>
  pull(popular)

set.seed(1234)
foldid = sample(1:5, size = nrow(train_x), replace = TRUE)

lambda = 10^(seq(2, -5, -0.1))

lasso_fit = glmnet(
  x = as.matrix(train_x), 
  y = train_y, 
  lambda = lambda,
  alpha=1, 
  family = "binomial"
  )

lasso_cv = cv.glmnet(
  x = as.matrix(train_x), 
  y = train_y, 
  lambda = lambda, 
  foldid = foldid,
  alpha=1, 
  family = "binomial"
)

lambda_opt = lasso_cv$lambda.min
broom::tidy(lasso_fit) |>
  select(term, lambda, estimate) |>
  complete(term, lambda, fill = list(estimate = 0) ) |>
  filter(term != "(Intercept)") |>
  ggplot(aes(x = log(lambda, 10), y = estimate, group = term, color = term)) + 
  geom_path() + 
  geom_vline(xintercept = log(lambda_opt, 10), color = "blue", size = 1.2) +
  theme(legend.position = "none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

result = predict(lasso_fit, s = lambda_opt, newx = as.matrix(test_x), type = 'response')
result = as.data.frame(result)|>
  mutate(prediction = ifelse(s1 > 0.5, 1, 0))|>
  mutate(actual = test_y)|>
  mutate(difference = ifelse(prediction != actual, 1, 0))
acc = (nrow(result) - sum(pull(result, difference)))/nrow(result)

result = result|>
  mutate(actual = factor(actual, levels = c(1, 0)))|>
  mutate(prediction = factor(prediction, levels = c(1, 0)))
confusionMatrix(data=pull(result, prediction), reference = pull(result, actual))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     1     0
##          1   305   170
##          0   884 17630
##                                           
##                Accuracy : 0.9445          
##                  95% CI : (0.9411, 0.9477)
##     No Information Rate : 0.9374          
##     P-Value [Acc > NIR] : 2.076e-05       
##                                           
##                   Kappa : 0.3431          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.25652         
##             Specificity : 0.99045         
##          Pos Pred Value : 0.64211         
##          Neg Pred Value : 0.95225         
##              Prevalence : 0.06262         
##          Detection Rate : 0.01606         
##    Detection Prevalence : 0.02501         
##       Balanced Accuracy : 0.62348         
##                                           
##        'Positive' Class : 1               
## 

From the confusion matrix, we could see that our model didn’t recognize lots of popular games. However, the amount of false positives is acceptable, this indicates that there might be patterns for unpopular games, using such model we might get some meaningful insight that can help us to identify those games that will not be popular.

most important tags and their effect

First, we will extract the beta from our models.

betas = coef(lasso_cv, s = lambda_opt)

beta_summary =  summary(betas)

beta_df = data.frame(tags = rownames(betas)[beta_summary$i],
           Weight = beta_summary$x)|>
  mutate(abs_weight = abs(Weight))|>
  arrange(desc(abs_weight))
beta_df|>
  subset(tags != "(Intercept)")|>
  head(30)|>
  ggplot(aes(x = reorder(tags, +abs_weight), y = Weight))+
  geom_bar(stat="identity")+
  coord_flip()

Tags are reasonable.